Show the code
pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, gralayouts, ggforce, tidytext, tidyverse, skimr, patchwork, ggdist, ggridges, ggthemes, scales)Shaun Tan
June 3, 2023
June 18, 2023
The country of Oceanus has sought FishEye International’s help in identifying companies possibly engaged in illegal, unreported, and unregulated (IUU) fishing. As part of the collaboration, FishEye’s analysts received import/export data for Oceanus’ marine and fishing industries. However, Oceanus has informed FishEye that the data is incomplete. To facilitate their analysis, FishEye transformed the trade data into a knowledge graph. Using this knowledge graph, they hope to understand business relationships, including finding links that will help them stop IUU fishing and protect marine species that are affected by it. FishEye analysts found that node-link diagrams gave them a good high-level overview of the knowledge graph. However, they are now looking for visualizations that provide more detail about patterns for entities in the knowledge graph
Use visual analytics to identify anomalies in the business groups present in the knowledge graph. Limit your response to 400 words and 5 images.
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
# distinct() %>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
select(id, country, type, revenue_omu, product_services)| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
Exploring Country of Origin
plot_company <- nodes_country %>%
filter(type == "Company" &
count > 150) %>%
ggplot(aes(x = reorder(country, -count), y = count)) +
geom_col() +
ylim(0,4000) +
geom_text(
aes(label = count),
vjust = -2
) +
labs(
title = "Count of Company's Country of Origin", y= "Count", x = "Country", subtitle = "Companies predominantly from ZH, Oceanus, and Marebak"
)
plot_owner <- nodes_country %>%
filter(type == "Beneficial Owner") %>%
ggplot(aes(x = reorder(country, -count), y = count)) +
geom_col() +
ylim(0,14000) +
geom_text(
aes(label = count),
vjust = -2
) +
labs(
title = "Count of Beneficial Owner's Country of Origin", y= "Count", x = "Country", subtitle = "Beneficial Ownders predominantly from ZH"
)
plot_contacts <- nodes_country %>%
filter(type == "Company Contacts") %>%
ggplot(aes(x = reorder(country, -count), y = count)) +
geom_col() +
ylim(0,8000) +
geom_text(
aes(label = count),
vjust = -2
) +
labs(
title = "Count of Company Contacts' Country of Origin", y= "Count", x = "Country", subtitle = "Company Contact predominantly from ZH"
)
plot_company / plot_owner / plot_contacts
Despite most of the owners and and company contacts orginating from ZH, there count of country of origin of the company is more diverse, with ZH, Oceanus, and Marebak taking the top 3 spots. It is indicative of owners venturing out of their own countries to set up companies in other countries.
Exploring revenue of companies
company_revenue <- mc3_nodes %>%
filter(type == "Company")
ggplot(company_revenue,
aes(y = revenue_omu)) +
scale_y_continuous(
limits = c(0, 200000),
breaks = pretty_breaks(n = 5),
labels = dollar_format())+
geom_boxplot(width = 0.5,
outlier.shape = NA, color = 'darkred') +
stat_dots(color = 'blue') +
coord_flip() +
labs(
title = "Distribution of Revenue of Companies", y= "Revenue", x = "Count", subtitle = "Highly right skewed distribution of companies' revenue"
)
Exploring the relationship between owners and companies:
Getting the number of owners each company has:
The table above displays the number of owner each company has. It is postulated that companies with multiple beneficial owners has the oversight of many people, it is unlikely to be engaged in dubious activities, whereas companies which are sole proprietorships are at the bidding of that single beneficial owner.
filtered_data <- owner_count_df %>%
filter(owner_count <= quantile(owner_count, 0.2))
ggplot(filtered_data, aes(x = factor(owner_count), y = count)) +
geom_col() +
ylim(0,7000) +
geom_text(
aes(label = count),
vjust = -1,
size = 3
) +
labs(
title = "Count of Companies by number of owners", y= "Count of Companies", x = "Number of Owners", subtitle = "Majority of companies have only one owner"
)
The above graph shows that 6415 companies are sole proprietorships.
Conclusion of EDA
Initial Sensing:
Revenue is a good place to start to explore anomalous behaviour:
The fish anomalous behavious that will be investigated would be sole beneficial owners of multiple companies, with companies having higher revenue being more suspicious. The reason being, there is no need for transparency and being accountable to shareholders for these companies. As such, they are less deterred from pursuing illegal activities given the lesser oversight.
single_bowner_count_revenue <- left_join(single_bowner_count, mc3_nodes, by = c("source"="id")) %>%
select(-type.y) %>%
rename("type" = "type.x")
single_bowner_count_revenue1 <- single_bowner_count_revenue %>%
distinct() %>%
rename("from" = "source",
"to" = "target")
bowner_source <- single_bowner_count_revenue1 %>%
distinct(from) %>%
rename("id" = "from")
bowner_target <- single_bowner_count_revenue1 %>%
distinct(to) %>%
rename("id" = "to")
bowner_nodes_extracted <- rbind(bowner_source, bowner_target)
bowner_nodes_extracted$group <- ifelse(bowner_nodes_extracted$id %in% single_bowner_count_revenue$source, "Company", "Beneficial Owner")Creating the visNetwork Graph
visNetwork(
bowner_nodes_extracted,
single_bowner_count_revenue1
) %>%
visIgraphLayout(
layout = "layout_with_fr"
) %>%
visGroups(groupname = "Company",
color = "lightblue") %>%
visGroups(groupname = "Company Contact",
color = "yellow") %>%
visLegend() %>%
visEdges(
arrows = "to"
) %>%
visOptions(
highlightNearest = list(enabled = T, degree = 2, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group",
collapse = TRUE)With the possibility of lesser oversight, there is the chance that these companies may be participating in suspicious activity. However, it should be noted that the size as well and the fidelity of the nature of business of these companies is not available on the network graph. It should therefore be investigated in greater detail, before a solid conclusion can be formed. However, in the meantime, it is the exception and not the norm and should be monitored.
The next suspicious behaviour that deserves investigating would be companies with many company contacts, that have missing revenue reported. There is a chance that these companies are in fact bring in substantial revenue but have undeclared their revenue.
# Extract nodes that have unreported revenue
nodes_norev <- mc3_nodes %>%
filter(is.na(revenue_omu))
nodes_norev_compcontact <- nodes_norev %>%
filter(type == "Company Contacts") %>%
distinct()
# Extracting edges that are company contacts
edges_norev <- mc3_edges %>%
filter(type == "Company Contacts") %>%
filter(source %in% nodes_norev_compcontact$id) %>%
distinct() %>%
rename("from" = "source",
"to" = "target")
# Extract edges that have more than or equal to 3 company contacts
edges_norev_high <- edges_norev %>%
group_by(from) %>%
mutate(count = n()) %>%
filter(count >= 3) %>%
ungroup()Creating the visNetwork graph
visNetwork(
nodes_norev1,
edges_norev_high
) %>%
visIgraphLayout(
layout = "layout_with_fr"
) %>%
visGroups(groupname = "Company",
color = "lightblue") %>%
visGroups(groupname = "Company Contact",
color = "yellow") %>%
visLegend() %>%
visEdges(
arrows = "to"
) %>%
visOptions(
highlightNearest = list(enabled = T, degree = 2, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group",
collapse = TRUE)It is indeed suspicious that so many large companies have unreported revenue. The potulated size, given the lack of revenue data, can only be extrapolated using the number of contacts, and the number of beneficial owners. It should be noted that the number of contacts of these top few companies are similar to those of the top few companies with reported revenue. This should be investigated in further details with other forms of proxy information obtained and pieced together to determine what they can possibly be hiding.
The last visual that i would be using to explore specifically fish-related anomalous behaviour would be the networks of biggest company-beneficial owner relationships of fish-related businesses. The reason that this is done is to compare and understand the typical network size of a fish-related business in terms of number of beneficial owners, and compare it with the industry standard.
# Extract nodes that are fish-related
fish_nodes <- mc3_nodes %>%
filter(grepl("fish", product_services, ignore.case = TRUE))
fish_nodes_bowners <- fish_nodes %>%
filter(type == "Beneficial Owners") %>%
distinct()
fish_nodes_companies <-fish_nodes %>%
filter(type == "Company") %>%
distinct()
# Extract edges that are fish related
edges_fish <- mc3_edges %>%
filter(type %in% c("Company", "Beneficial Owner")) %>%
filter(source %in% fish_nodes$id) %>%
distinct() %>%
rename("from" = "source",
"to" = "target")
# Extract edges that have more than or equal to 8 links
edges_fish_high <- edges_fish %>%
group_by(from) %>%
mutate(count = n()) %>%
filter(count >= 8) %>%
ungroup()Creating the visNetwork Graph
visNetwork(
nodes_fish1,
edges_fish
) %>%
visPhysics(solver = "forceAtlas2Based",
forceAtlas2Based = list(gravitationalConstant = -100)) %>%
visIgraphLayout(
layout = "layout_with_fr"
) %>%
visGroups(groupname = "Company",
color = "yellow") %>%
visGroups(groupname = "Beneficial Owner",
color = "lightblue") %>%
visLegend() %>%
visEdges(
arrows = "to"
) %>%
visOptions(
highlightNearest = list(enabled = T, degree = 2, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group",
collapse = TRUE)It is interesting to note that there are no personnel that are beneficial owners of more than 1 companies for these “large” or extensive companies. There is a possibility that they do not want to have multiple owners for fear of a conflict of interest or corporate espionage. While this is not necessarily anomalous behaviour, it is an interesting point to note, and it may even be representative of the different cartels in the fish industry.